NLPashto: NLP Toolkit for Low-resource Pashto Language

نویسندگان

چکیده

In recent years, natural language processing (NLP) has transformed numerous domains, becoming a vital area of research. However, the focus NLP studies predomi-nantly centered on major languages like English, inadvertently neglecting low-resource Pashto. Pashto, spoken by population over 50 million worldwide, remains largely unexplored in research, lacking off-the-shelf resources and tools even for fundamental text-processing tasks. To bridge this gap, study presents NLPashto, an open-source publicly accessible toolkit specifically designed The initial version NLPashto introduces four state-of-the-art models Spelling Correction, Word Segmentation, Part-of-Speech (POS) Tagging, Offensive Language Detection. also includes essential pre-trained static word embeddings, Word2Vec, fastText, GloVe. Furthermore, we have monolingual model Pashto from scratch, using Bidirectional Encoder Representations Transformers (BERT) architecture. For training evaluation all models, developed several benchmark datasets included them toolkit. Experimental results demonstrate that exhibit satisfactory perfor-mance their respective This can be significant milestone will hopefully support speed-up future research field NLP.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Speech translation for low-resource languages: the case of Pashto

We present a number of challenges and solutions that have arisen in the development of a speech translation system for American English and Pashto, highlighting those specific to a very low resource language. In particular, we address issues posed by Pashto in the areas of written representation, corpus creation, speech recognition, speech synthesis, and grammar development for translation.

متن کامل

EstNLTK - NLP Toolkit for Estonian

Although there are many tools for natural language processing tasks in Estonian, these tools are very loosely interoperable, and it is not easy to build practical applications on top of them. In this paper, we introduce a new Python library for natural language processing in Estonian, which provides a unified programming interface for various NLP components. The ESTNLTK toolkit provides utiliti...

متن کامل

Language Resource Development at DLSU-NLP Lab

In 2003, the Department of Science and Technology awarded a 5million-peso grant to De La Salle University for the development of an EnglishFilipino Machine Translation System. Faced with limited resources for the Filipino language, the team has to build language resources and develop language tools in order to complete the system. This paper presents the different resources and tools created an...

متن کامل

NLP (Natural Language Processing) for NLP (Natural Language Programming)

Natural Language Processing holds great promise for making computer interfaces that are easier to use for people, since people will (hopefully) be able to talk to the computer in their own language, rather than learn a specialized language of computer commands. For programming, however, the necessity of a formal programming language for communicating with a computer has always been taken for gr...

متن کامل

Sublexical Translations for Low-Resource Language

Machine Translation (MT) for low-resource language has low-coverage issues due to Out-OfVocabulary (OOV) Words. In this research we propose a method using sublexical translation to achieve wide-coverage in Example-Based Machine Translation (EBMT) for English to Bangla language. For sublexical translation we divide the OOV words into sublexical units for getting translation candidates. Previous ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: International Journal of Advanced Computer Science and Applications

سال: 2023

ISSN: ['2158-107X', '2156-5570']

DOI: https://doi.org/10.14569/ijacsa.2023.01406142